Multi-GPU by ngc92 · Pull Request #335 · gpu-mode/kernelbot

ngc92 · 2025-08-23T16:00:44Z

Implement multi-gpu tasks:
design:
run_eval still calls only a single instance of eval.py, but if the task is multi-gpu, eval.py will create a pool of multiple worker processes, and call the user op on each with a different rank argument.

in the current code, input generation and testing is also handled by each process independently; I'm not sure if that's the right approach.

msaroufim · 2025-08-26T01:27:12Z

examples/eval.py

+    import torch.distributed as dist
+    world_size = test.args["world_size"]
+    os.environ["MASTER_ADDR"] = "127.0.0.1"
+    os.environ["MASTER_PORT"] = "12356"


might wanna setup a large random port, if for some reason a job fails and the port doesn't get released

msaroufim

This is a really cool PR, I don't think I've seen anyone recently manually manage process pools as opposed to just calling torchrun. I think this solution works quite nicely for us though because we don't care (yet) about multi node or fault tolerance

Idk if he has time but just in case @kiukchung would really appreciate your feedback on whether this is a reasonable way of supporting a job system where people submit distributed kernels

The part I'm a bit worried about is the example right now can check that a kernel is globally correct by iterating over many ranks and making sure they're locally correct. That's probably not going to be true for any of the distributed AMD kernels. This is fine as long as we offload the burden of gathering all the outputs onto the user

examples/eval.py

src/libkernelbot/backend.py

danielhua23 · 2025-08-26T05:36:13Z

examples/eval.py

+        check_copy = _clone_data(data, rank)
+
+        # first, one obligatory correctness check
+        output = custom_kernel(_clone_data(data, rank))


I'd like to confirm, if that, users' custom_kernel should be designed to an API accepting and outputing single rank data? but my ref_kernel is accepting all rank data and outputing all rank result...is that conflicted or something need to be changed in my PR? https://github.com/gpu-mode/reference-kernels/pull/51/files#diff-4634bd7a4a47ab89859ee0db3f4f3f3c8123cf18981fb4d24b4655a412777013R240

I think essentially, the user kernel should be what is called _worker in your code. I wanted to avoid passing tensor objects across process boundaries with the multiprocessing module; I see in your example, you also transfer the tensor to the CPU before returning from worker.

yes, _worker is actually the kernel for each rank to run. In your case, what you expect people submit is the custom_kernel including worker only? does this also work based on your infra? In other words, does the submission.py（https://github.com/gpu-mode/reference-kernels/pull/51/files#diff-e7f1884ce42d6b30a70551a1a922b2547b8cb7b91e959a2ceb6f4471740dde95) work based on your infra?

big thanks, I saw siro's commit to my PR, which is compatible with this PR

ngc92 · 2025-08-26T09:03:24Z

This is a really cool PR, I don't think I've seen anyone recently manually manage process pools as opposed to just calling torchrun. I think this solution works quite nicely for us though because we don't care (yet) about multi node or fault tolerance

Idk if he has time but just in case @kiukchung would really appreciate your feedback on whether this is a reasonable way of supporting a job system where people submit distributed kernels

The part I'm a bit worried about is the example right now can check that a kernel is globally correct by iterating over many ranks and making sure they're locally correct. That's probably not going to be true for any of the distributed AMD kernels. This is fine as long as we offload the burden of gathering all the outputs onto the user

my first attempt was to just torchrun eval.py, but that breaks our file descriptor approach for result reporting. so i followed @danielhua23's example and did the process startup manually, which fit quite nicely with the existing code that had a process pool of one process anyway.

If you cannot do the check locally, there's nothing preventing us from having all-gather calls inside the validation function.

kiukchung · 2025-08-26T16:59:47Z

Idk if he has time but just in case @kiukchung would really appreciate your feedback on whether this is a reasonable way of supporting a job system where people submit distributed kernels

Managing your own processes is perfectly valid. torchrun handles things like: error propagation, proces-group-gang-semantics, better logging/console output, etc. That said, it was primarily built for training applications.

The use-case here seems to be data-parallel eval in a single host (but multi-gpu) where the rank is used to partition the eval dataset, so perhaps gang-semantics is not desired (e.g. torchrun will SIGTERM all workers if at least 1 worker fails, which is the behavior we want during training).

For completeness I'll mention that torchrun can be used programmatically (see: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L927-L930):

from torch.distributed.launcher.api import elastic_launch, LaunchConfig, DefaultLogsSpecs, Std

config = LaunchConfig(
   role=f"eval_{name_of_task}", # name the worker procs (mostly informational)
   min_nodes=1,
   max_nodes=1, # set both min and max to 1 in your case (single host)
   nproc_per_node=num_gpus, 
   rdzv_endpoint="127.0.0.1",
   rdzv_backend="c10d",
   max_restarts=0, # do not restart the gang (fail on the first failure)
)
  
elastic_launch(config, entrypoint=main_fn)(main_fn_args)

github-actions · 2025-08-26T23:12:47Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
src/libkernelbot
consts.py
task.py					119, 162
Project Total

_{This report was generated by python-coverage-comment-action}

Snektron mentioned this pull request Aug 25, 2025

Initial AMD Profiling #339

Merged

6 tasks

msaroufim reviewed Aug 26, 2025

View reviewed changes

examples/eval.py Outdated Show resolved Hide resolved

src/libkernelbot/backend.py Outdated Show resolved Hide resolved

danielhua23 reviewed Aug 26, 2025

View reviewed changes

ngc92 force-pushed the ngc92/multi-gpu branch 3 times, most recently from e5b7a0d to e2071bd Compare August 26, 2025 10:53

ngc92 added 5 commits August 27, 2025 00:23

add multi-gpu benchmark implementation

338b0e6

better validation of task

735dc25

multi-gpu benchmark test

dcf7251

add timeouts to get calls

c389bc2

fix tests

71c40c1

ngc92 force-pushed the ngc92/multi-gpu branch 2 times, most recently from 71c40c1 to 54f7716 Compare August 26, 2025 23:11

ngc92 force-pushed the ngc92/multi-gpu branch 3 times, most recently from 776931b to 7f8c48f Compare August 26, 2025 23:24

fix tests

83cecf0

ngc92 force-pushed the ngc92/multi-gpu branch from 7f8c48f to 83cecf0 Compare August 27, 2025 12:25

ngc92 changed the title ~~[WIP] Multi-GPU~~ Multi-GPU Aug 27, 2025

msaroufim approved these changes Aug 27, 2025

View reviewed changes

ngc92 merged commit aa3e6ae into main Aug 27, 2025
5 checks passed

ngc92 deleted the ngc92/multi-gpu branch August 27, 2025 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Multi-GPU#335

Multi-GPU#335
ngc92 merged 6 commits intomainfrom
ngc92/multi-gpu

ngc92 commented Aug 23, 2025

Uh oh!

msaroufim Aug 26, 2025

Uh oh!

msaroufim left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

danielhua23 Aug 26, 2025 •

edited

Loading

Uh oh!

ngc92 Aug 26, 2025

Uh oh!

danielhua23 Aug 26, 2025 •

edited

Loading

Uh oh!

danielhua23 Aug 26, 2025

Uh oh!

ngc92 commented Aug 26, 2025

Uh oh!

kiukchung commented Aug 26, 2025

Uh oh!

github-actions bot commented Aug 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

ngc92 commented Aug 23, 2025

Uh oh!

msaroufim Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

msaroufim left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielhua23 Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngc92 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

danielhua23 Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielhua23 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

ngc92 commented Aug 26, 2025

Uh oh!

kiukchung commented Aug 26, 2025

Uh oh!

github-actions bot commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

msaroufim left a comment •

edited

Loading

danielhua23 Aug 26, 2025 •

edited

Loading

danielhua23 Aug 26, 2025 •

edited

Loading

github-actions bot commented Aug 26, 2025 •

edited

Loading